32 research outputs found
Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs
Parallel sentences are a relatively scarce but extremely useful resource for
many applications including cross-lingual retrieval and statistical machine
translation. This research explores our methodology for mining such data from
previously obtained comparable corpora. The task is highly practical since
non-parallel multilingual data exist in far greater quantities than parallel
corpora, but parallel sentences are a much more useful resource. Here we
propose a web crawling method for building subject-aligned comparable corpora
from Wikipedia articles. We also introduce a method for extracting truly
parallel sentences that are filtered out from noisy or just comparable sentence
pairs. We describe our implementation of a specialized tool for this task as
well as training and adaption of a machine translation system that supplies our
filter with additional information about the similarity of comparable sentence
pairs
Shallow reading with Deep Learning: Predicting popularity of online content using only its title
With the ever decreasing attention span of contemporary Internet users, the
title of online content (such as a news article or video) can be a major factor
in determining its popularity. To take advantage of this phenomenon, we propose
a new method based on a bidirectional Long Short-Term Memory (LSTM) neural
network designed to predict the popularity of online content using only its
title. We evaluate the proposed architecture on two distinct datasets of news
articles and news videos distributed in social media that contain over 40,000
samples in total. On those datasets, our approach improves the performance over
traditional shallow approaches by a margin of 15%. Additionally, we show that
using pre-trained word vectors in the embedding layer improves the results of
LSTM models, especially when the training set is small. To our knowledge, this
is the first attempt of applying popularity prediction using only textual
information from the title